Introduction

In this task, we will try to cluster the dataset geopol using different clustering techniques.

K-means algorithm

Firstly we will perform the K-means clustering algorithm. We will choose 3 clusters to be made, since they form reasonable groups. Also, we will run the algorithm on scaled version of the data, since the scales differ alot. Let us visualize the grouped data.

We might see that the data are well grouped atleast with respect to the illiteracy rate and GDP. There is a clear group containing poor, third world countries with high illiteracy rate and low GDP (green group). Then there is a group of countries with low GDP but having with basic needs met such as eastern european countries (blue group). The third group defines well developed, mostly western countries with high GDP (red group).

We can also investigate the other dimensions on the figure below.

We might see that the groups are mixed together with respect to population. Also the green group tends to have the lowest values in all of the variables eltp, rnnr, nunh, nuth which all somehow reflects the basic needs and also in the variable rspo (no. of students), which reflects the educational level. Also we can see that the other two groups are sort of mixed together in all of these variables and what really seperates them is the economical performance.

Hierarchical algorithm

Let’s compare the results with the hierarchical algorithm.

We might see that the results are somewhat different. It still tries to seperate the developed countries from the undeveloped ones. But now the seperation is not that strong since in this case, one group (blue) consists of only two countries, namely India and China, therefore it looks like the population had a serious impact on the clustering in this case opposed to the first approach, where the population did not contribute to form the groups at all.